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Abstract We derive an invertible transform linking two widely used measures of species 
diversity: phylogenetic diversity and the expected proportions of segregating (non-constant) 
sites. We assume a bi-allelic, symmetric, finite site model of substitution. Like the Hadamard 
transform of Hendy and Penny, the transform can be expressed completely independent 
of the underlying phylogeny. Our results bridge work on diversity from two quite distinct 
scientific communities. 
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1 Introduction 

The quantification of biodiversity is one of the central conservation applications of molecu- 
lar ecology. There are, however, multiple ways to evaluate diversity, and different scientific 
communities have emphasised different measures. 

Among population geneticists, one of the two standard measures for diversity is the 
proportion of segregating sites (s) (Watt erson]|1975| l. A site is segregating if it varies over 
the sampled taxa, and the proportion of segregating sites is often used when estimating 
population parameters. If A is a subset of sampled taxa we let sa denote the probability that 
a site varies over the taxa in A. We note that the other standard population genetic measure of 
diversity, the expected pairwise divergence n, is then just the average of sa over all subsets 
A of size two. 
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Among phylogeneticists it is now standard to incorporate the phylogeny into assess- 
ments of diversity. Specifically, if T is a phylogeny and A is a subset of the taxa then the 
phylogenetic diversity 8a of A with respect to T is the sum of the branch lengths in the 
smallest subtree of T connecting A ( |Faith|[T992[ l. One potential weakness of this measure 
is its dependence on a specific phylogeny, a problem that can be at least partially addressed 
by considering a distribution of trees ( |Minh et al| |2009[ [Spillner et ar}|2008||Moufton et"aT] 
[2007) 1. 

Here we consider the assessment of diversity using bi-allelic sites from the same hap- 
lotype block, that is, there is no recombination and all of the sites evolved on the same 
phylogeny T. We will show that in this instance the segregating site probabilities sa and the 
phylogenetic diversity measures 8a are tightly linked. Specifically: 

(i) If we know the probabilities sa for all subsets A of even cardinality then we can deter- 
mine sa for subsets of any cardinality. Likewise, if we know 8a for all subsets A of even 
cardinality then we can determine 8a for all subsets of any cardinality. 

(ii) When the mutation rates are symmetric, then the segregating site probabilities and phy- 
logenetic diversity probabilities are linked by an invertible transform 

« = -Glog(l-G -1 s). (1) 

Here G is a non-singular matrix independent of T , while 8 and s are vectors of phyloge- 
netic diversity values and segregating site probabilities, both indexed by subsets of even 
cardinality. The entries Gab of G are given by 

G AB = | 2_|A|+1 ' 5 - A; (2) 
1 0, otherwise, 

This transform can be used to relate phylogenetic diversity and segregating sites without any 
knowledge of the true phylogeny. 

The transform in equation Q is therefore a diversity analogue of the elegant Hadamard 
transform ( |Hendy and Penny| [T989) |Hendy| [T989) |Hendy and Penny| [T993) |Hendy| [2003) 
which maps branch weights in a tree to pattern probabilities. As could be expected, the two 
transforms are closely related mathematically. We make use of Hendy's path-set approach 
when we prove the correctness of our transform. Note that a version of |T|| can be obtained by 
transforming diversities to split weights (using Theorem [4] below), applying the Hadamard 
transform, then transforming pattern probabilities to segregating site probabilities. This ap- 
proach introduces an undesirable asymmetry in the indexing of the transform because of the 
need to specify a reference taxon in the Hadamard transform. We were unable to remove 
this asymmetry, and instead derived a more direct result. 

To illustrate the result, consider the four taxon phylogeny in Figure]!] A pattern is simu- 
lated by selecting an (arbitrary) root node, choosing the state at the root uniformly at random, 
and then evolving the states along branches away from the root. For each branch, if t is the 
length of the branch then the probability of a state change along that branch is i(l — e~ 2 '). 
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Fig. 1 Four taxon tree with branch lengths indicated. 

Under this symmetric model, the probabilities of the various site patterns are 
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The phylogenetic diversity values for any subset can be computed directly from the tree 
by summing the appropriate branch lengths, while the segregating site probabilities can be 
computed from the site pattern probabilities. In this way we obtain 

8 = {8 a b,8 ac ,8ad,5bc,5bd,8cd,5abcdy 

= (0.04,0.04,0.03,0.04,0.05,0.05,0.08)'; 

S {.^abi^aci^adi^bct^bdz^cdi^abcd^} 

= (0.0384, 0.0384, 0.029 1 , 0.0384, 0.0476, 0.0476, 0.0762)'. 
Our main result says that these two vectors are related by where in this case 
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A quick calculation validates the formula in this instance. 

The paper is structured as follows. The next section introduces some basic properties 
relating segregating site probabilities to site pattern probabilities. Section 3 shows that these 
properties have analogues relating phylogenetic diversity and branch lengths. Section 4 pro- 
vides the bridge connecting segregating sites and phylogenetic diversity. 
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2 Properties of segregating site probabilities 

Let X denote the set of taxa. For any A C X, let pa denote the probability that, for a given 
site, the taxa in A have state 1 and the taxa in A = X \ A have state 0. At this point, we will 
not be making any assumptions about the probabilities p& beyond that they are non-negative 

and Ltcx/?A = 1. 

Let p denote the symmetrised version of p, defined by 

Pa = {pa +Pa)/ 2 - 

A site is segregating over a subset A C X if it is not constant over A. Hence if B is the set of 
taxa with state 1 for some site then the site is segregating if and only if A n B and A n B are 
both nonempty. We therefore define the collection 

y A = {B :Anfi^0andAn5^0} 

so that the probability that a randomly chosen site is segregating over A is given by 

S A = P B - ( 3) 

Since B 6 if and only if B 6 5?a we a ls° have 

SA= E ( P b+Pb)/2= E P(B). (4) 

BeJ? A Be,y A 

There are 2"~' — 1 degrees of freedom for the symmetrised probabilities p. In contrast, there 
are 2" — n — 1 values for s, one for every subset with cardinality at least two. Hence there 
must therefore be a great deal of redundancy in the segregating site probabilities. We show 
here that the probabilities s A for all A C X are determined by the probabilities s A for A with 
even cardinality, so that segregating site probabilities have the same degrees of freedom as 
the symmetrised site pattern probabilities p. 

We will make use of the tangent numbers Tj, which are defined by the power series 

tanhW = £ T ^ 

and are related to the better known Bernoulli numbers via the identity 

T t =2*+ 1 (2*+ 1 -l)B, +1 /(^+l), 

see ( |Cohen| [20071 PS- 6 " 7 )- 
Theorem 1 1. For all A CX, 

S A =Y(-l) m SB- (5) 

BCA 

2. If\A\ is odd then S& is determined by sg values for \B\ even, by 

BCA L 
\B\even 
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Proof 1. Suppose that A C X. Then we get 

£(-1)1%,= £ £ S B 

BCA BCA BCA 

\B\even \B\odd 

= E E /5(c)- E E « 

\B\even \B\odd 

Consider any C C X. If A n C = or A n C = then there is no 5 C A such that C G 
Otherwise, suppose that C e S*a, let k = |A n C\ and / = |A \ C|. 

The number of subsets BCA such that |B| even and C € S^b is then 

2* _1 2 i_1 + - 1)(2' _1 - 1) = 2*+'- 1 - 2* _1 - 2'" 1 + 1 

which follows by looking at the possible (non-empty) intersections of B with A\C and A n C. 
The number of subsets BCA such that |B| even and C € S?b is 

2*- 1 (2'- 1 - 1) + (2* _1 - 1)2'-' =2* :+ '- 1 -2*- 1 -2'- 1 . 

Hence 

£(-1)1% = £ (2*+'- 1 -2*- 1 -2 / - 1 + l) / 5(C)- £ (2 i + / - 1 -2 i - 1 -2 / - 1 ) / 5(C) 

bca c&y A c&y A 

= E p( c ) 

cey A 
= s A - 

2. The result holds trivially when |A| = 1. Suppose |A| = 2r+ 1 and the result holds for sets 
with cardinality less than |A|. Then from part 1, 

SA = \ E *B-\ E S C 

BcA CCA 
\B\even \C\odd 



E E E ici_iBi s * 



BCA CCA BCC 

\B\even \C\odd \B\even 



1 r 

E E 



9 

^£=0 BCA 



,_£ £ T 2(j-- I 



dw, , j=kC:BcCCA 
\B\=2k \ \C\=2j+l 



2 2(j-k) + \ 



SB 



! E E ( ( 2{ r-Q + l \ T w 



|B|=2* 



*B 



The identity 



T 2(r -*)+i _ 1_ l r -^r72(r-fc) + l^T 27+1 



22(r-t)+i ~2 2 5n V 2y + l ^ 2 2 J+! 



7=0 

can be proven by substituting the power series 



tanhM=E oPtTT)! T 2 , +1 
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into the expression 

tanh(f/2 ) = pll 

and equating coefficients. 

Having shown that the s values and p values have the same degrees of freedom, we now 
give an explicit formula for the p values in terms of all sb values. A consequence of this 
result is that, in the case of symmetric mutation rates, the segregating site probabilities are 
(trivially) sufficient statistics for inferring trees of population parameters. 

Theorem 2 For all B <ZX we have 

PB = lZ (-lfHsl+V (7) 

L A-.BQA 

Proof Suppose that C B C X. Every site segregating over X \ B is segregating over X. 
Conversely, a site is segregating over X but not over X \ B exactly if all the taxa with ones 
for the site are contained in B or all the taxa with ones are contained in X \ B. Hence 

SX~SX\B = £ 2 PX\V- 
VCB 

Applying Mbbius inversion we have for all ^ V C X that 

L BCV 

= ^E(-i) |yHB|+ V\ B 

Z BCV 

= 1 £ (-l)W-l^, 

1 AZ}U 

this last step given by the substitution A = X \B. 



3 Properties of phylogenetic diversities 

Let T be a phylogeny with branch lengths and leaf set X. We defined the phylogenetic 
diversity of a subset A C X as the length of the smallest subtree of T connecting the leaves 
in A, where the length of a subtree is the sum of all its branch lengths. |Minh et al| ([2006 1 
and |Moulton et al| |2007 ) showed that phylogenetic diversity could also, conveniently, be 
defined in terms of splits. 

A split of X is a bipartition of X into two parts, where here we will also permit the trivial 
bipartition %\X. We regard U\V and V\U as the same split and let Ex denote the set of all 
splits of X. Deleting an edge e in a phylogenetic tree induces a split a e of X given by the leaf 
sets of the resulting two components. The set of all splits of a tree T obtained in this way is 
denoted A split weight function is a map w : Ex — > 9?>o- The split weight function for 

a phylogenetic tree with branch lengths is defined by 

[ b e if U\V = o e for some edge e 6 E(T) with length b e 
U<tV 1 otherwise. 
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See Chapter 3 of Semple and Steel ( 2003 ) for more on splits and their uses in phylogenetic 
combinatorics. 

Moulton et al (2007) and |Minh et al| l |2009| l showed that if w is the split weight corre- 
sponding to a tree with taxon set X, and ACX, then 

8 A = £ wu\v- (9) 
u\veE x 

vnA^D 

Actually, this formulation applies to any split weight function, allowing an extension of 
phylogenetic diversity to phylogenetic networks. 

As we shall see, the mathematics of phylogenetic diversity is in many ways dual to the 
mathematics of segregating sites. As before, we define for A C X the set ,5?a = {B C X : 
Bf)A=i(d,BnA=£<d}. Let w be the function from subsets of X to real numbers given by 

w A = w A[J /2. 

We then have from (|9]l that 

8 A = £ w B . (10) 

This is, of course, the same as Q with a change of labels. Hence we immediately obtain the 
analogues to Theorem[T]and Theorem[2] 

Theorem 3 1. For all ACX, 

Sa= (11) 

BCA 

2. For all ACX with odd cardinality, 

c _ V- T !^|-|g| g n9 x 

bca-.\b\ even 

Theorem 4 For all B <ZX we have 

*b= \ I (-lfH^I+V ( i3) 

1 A-.BCA 



4 An invertible transform for segregating sites and diversities 

We have seen how the phylogenetic diversity and segregating site probabilities have a similar 
structure. Here we formally establish a link between the two, one that works irrespective of 
the underlying phylogeny. The transform takes a vector of sa values and returns a vector of 
8a values, and does so via intermediate values y and pt which we now define. 

Theorem 5 1. For each ACX define 

pB j/|A| is even; 

B:\Ar\B\odd 

if \A\ is odd. 
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Then 



YA 



SB 



I (-2) 



\B\-2 



sb; 



BQA 
1 



E ya- 



2. For each A<ZX define 



Then 



Ma 



Y, if \A\ is even; 

B:\AC\B\odd 

if \A is odd. 



H A = E("2) |B| - 2 5 B ; 

BQA 

5 B = ^TZ^ E ^A' 



2 |B|-2 



AQB 



Proof Substituting Q into the right hand side of i \\4\ we obtain 

E(-2)' B '- 2 , B = E(-2) |Bh2 E PC 

BQA BQA Cey's 



E 

CQX 



( 



E (-2) |Bh2 

\bp,c, snc^e / 



PC- 



(14) 
(15) 



(16) 
(17) 



The number of subsets fiCAof cardinality \B\ = k such that B n (A n C) / and B n (A \ 
C) ^ equals ('^') - (' Al ^ C ') - (' A ^ C ') ■ By applying the binomial theorem three times, we 
obtain 



E (-2) 

BQA:_ 



\B\-2 



1 W 

L(-2)' 



fc=2 



Ancj 



|A\C| 



= £ (1 + (-1)^1 - (-I)IATCI _ (_1)|A\C|)_ 



(18) 



If |A| is odd, or if |A| is even and |ADC| is even, then ( |18| l becomes zero. If |A| is even 
and |AnC| is odd, then i |18[ l evaluates to 1. This proves l |14[ >. The inverse relation ( |15| l now 
follows by applying Mobius inversion to ( |14) . 

The identities l |16| l and ( |17[ l are proved in the same way. 



Combining Theorem [5] and Theorem [T] we see that for each even cardinality set A, the 
value Ja is a linear function of the values sb with \B\ even. Likewise, each value fi A is a 
linear function of the values 8b with even. The following theorem makes the relationship 
explicit. We make use of the Euler numbers E^, which are defined by the generating function 

-^ = — = £>-. 

cosh(x) e x +e x k\ 

see ( |Cohen| [20071 PS- 7 )- 
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Theorem 6 If\A\ is even then 



7/i = T X 2l B lE| Ah | B |. ?B , 



BCA 
\B\even 



Ma = -. E 2l*lE| AH2J |S B . 



BCA 
\B\even 



Proof Applying Theorem [5] and Theorem[TJwe obtain 

y A = E("2) |B| - 2 ^ 

BCA 



BCA 
\B\even 



CCA 
|C|odd 



1-2 T |C|-|B| i 



BCA CCA BCC 

\B\even \C\odd \B\even 



E 1_ E 

BCA C.BCCCA 
\B\even |_ \C\odd 



2ic|-|5| ' 



2 \B\-2 



If \A\ =2rand \B\ = 2fc then 



i- E T 

C.BCCCA 
\C\odd 

Here we can apply the identity 



^-m- 1 -%(2U-k) ) +l) T2 ^ +l - 



which is proven by substituting the generating functions 

=tanh(x) 



£ E 4 

f — A t 



k=0 



! e -v- _|_ g-.v 



(19) 



(20) 



(21) 



(22) 



into the expression 

(l_ to nh(,)y = -A_. 
Applying l |22| i to (21) , and using the fact that To = — 1 and = for k > 1 , we obtain 

r - 1 ( 2{r-k) 

proving the theorem. 



E ( 2y -+ 1 _ 2yt ) T 20-*) + i = 1-E 2( ,-,), 
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The final link in the transform is the map between y and fx. Let T be the underlying 
phylogenetic tree, so that 8a is the length of the minimal subtree connecting A in J, and s A 
is the probability that a site generated on T is not constant over A. 

A path- set of T is the set of edges in the disjoint union of a set of leaf-to-leaf paths of 
T. Path-sets were introduced by |Hendy and P enny (1993) when proving the correctness of 
the Hadamard transform, and we will make use of several of their results (see also |Hendy] 
[2005). 

Consider a path-set with set of endpoints A, noting that a path-set is uniquely determined 
by its set of endpoints. The sum of edge lengths in the path-set, or the length of the path-set 
equals the sum of the split weights for all splits U\V in the tree such that t/flA and \V HA 
are both odd. Hence, the length of the path-set is [Ia- 

The probability that an odd number of taxa in A have state 1 is the sum of probabilities 
pu over all U with t/flA odd. Hence, this probability equals y A . 

The core of the Hadamard transform is a formula connecting the length of a path-set 
(in our case, ^La) and the probability that a site assigns 1 to an odd number of taxa in the 
endpoints of the path-set (in our case, y A ). From |Hendy and Penny |(T99"3"| > we obtain 

MA = -^log(l-2 7A ). (23) 

We now have invertible transforms from 8 to pt to y to s. The following theorem makes the 
composite transform explicit. 

Theorem 7 Let 8 and s denote the vector of 8a and sa values, where A ranges over non- 
empty subsets ofX with even cardinality. Then 



where 



and 



Gab — 



-Glog^-G-'s), 

2-\M+\ B CA; 
0, otherwise, 



(G _ V _ ,21*1-^*1, BCA; 



0, otherwise. 



Proof From Theorem [6] the above review of path-set results in Hend y and Penny | ([T993 1, 
and Theorem]?], we have for all A C X with |A| even, 

7a = \Z 2 |fl| E 1AHfl | SB 

14 BCA 



B 



even 



li A = -^log(l-2y A ) 
5a = ^\a\=2 E Mb- 

z BCA 
\B\even 
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